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Abstract. 

Objective. The main goal of this work is to develop a model for multi-sensor signals 
such as MEG or EEG signals, that accounts for the inter-trial variability, suitable 
for corresponding binary classification problems. An important constraint is that 
the model be simple enough to handle small size and unbalanced datasets, as often 
encountered in BCI type experiments. 

Approach. The method involves linear mixed effects statistical model, wavelet 
transform and spatial filtering, and aims at the characterization of localized 
discriminant features in multi-sensor signals. After discrete wavelet transform and 
spatial filtering, a projection onto the relevant wavelet and spatial channels subspaces 
is used for dimension reduction. The projected signals are then decomposed as the sum 
of a signal of interest (i.e. discriminant) and background noise, using a very simple 
Gaussian linear mixed model. 

Main results. Thanks to the simplicity of the model, the corresponding parameter 
estimation problem is simplified. Robust estimates of class-covariance matrices are 
obtained from small sample sizes and an effective Bayes plug-in classifier is derived. 
The approach is applied to the detection of error potentials in multichannel EEG data, 
in a very unbalanced situation (detection of rare events). Glassification results prove 
the relevance of the proposed approach in such a context. 

Significance. The combination of linear mixed model, wavelet transform and spatial 
filtering for EEG classification is, to the best of our knowledge, an original approach, 
which is proven to be effective. This paper improves on earlier results on similar 
problems, and the three main ingredients all play an important role. 


Submitted to; Journal of Neural Engineering 



Detecting EEC Evoked Potential using a Wavelet Domain Linear Mixed Model 2 

1. Introduction 

Electro- and Magneto-encephalography (respectively, EEG and MEG) are of the 
rare techniqnes allowing non-invasive brain investigation with an excellent temporal 
resolution and, under some conditions, a fairly good spatial one. One main limitation, 
however, is that extracting the brain activity of interest from background activity and 
noise usually requires averaging a large number of repetitions of the signal recorded 
in the “same” condition (for example, averaging several epochs of signal following 
the same repeated stimulus). Such an averaging, however, distorts the signal and 
prevents precise investigation of the dynamics of the underlying processes. Indeed, 
it has long been known that averaging has non-linear impact on the latencies estimation 
pSl EHl [H]. For example, the latency of the onset of an activity measured on the 
average largely underestimates the real mean of the individual onsets [521 iH E2], 
making its use problematic for direct comparison with chronometric variables such 
as Reaction Time (RT) [TTl ITT] . The averaging process also prevents from analyzing 
learning and/or adaptation effects across trials repetitions |16]. More generally averaging 
trials eliminates the signal of interest variability, although the latter contains important 
information that can be useful in various contexts, from signal interpretation to 
classihcation. 

For these reasons, methods for single-trial EEG analysis and classihcation have 
been developed during the last decades. To reduce the impact of noise and background 
activity in single-trial analysis, common strategies have been to extract more elementary 
parts concentrating the signal of interest, either in time, in frequency, or in space. Such 
approaches include, among others, selection of time domains, frequency hltering, wavelet 
decomposition [l5l 156] , adaptive basis selection [MIS], matching pursuit mn or blind 
source separation [29] . In those approaches, one aims at simply getting rid of the 
variability induced by the background activity and the noise, hoping that the remaining 
variability will only be attributable to the signal of interest. 

Alternative strategies can be based on explicit signal modelling. For example the 
Linear Discriminant Analysis (LDA) which is one of the most popular classihers for EEG 
single-trials detection (and for brain computer interface - BGI - applications, see [ST] ) 
can be interpreted in terms of very simple Gaussian mixture model. In this setting LDA 
assumes a class-independant covariance matrix. In a recent overview [^ , Blankertz and 
coworkers proposed a regularized Gaussian mixture model for event related potentials 
(ERPs). In the latter, the single-trial is written as the sum of a signal of interest which 
is approximated as constant over trials and a random background activity modelled 
as a Gaussian noise. The inter-trial variability is not taken into account. The class- 
covariance matrices are assumed to be equal, which leads to a very simple detection 
algorithm that turns out to be very efficient for binary classification tasks. 

However when this equality assumption is not valid, the problem becomes more 
complex, especially when the two classes are unbalanced. In that context, standard 
classifiers, such as LDA, may fail as the estimated common covariance matrix is largely 
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determined by the majority class m IHT] , Taking into account the difference of the 
class-covariance matrices leads to a quadratic classihcation rule (QDA) and requires the 
estimation of a covariance matrix for each class. This leads to a robustness problem in 
the estimates, the more so when the size of the minority class is small. 

In BCI and more generally in EEG experiments, such an unbalanced situation is 
not unfrequent. For example in BCI, the P300 speller protocol naturally generates two 
unbalanced classes [IH] . Classical P300 spellers, with a 6 x 6 matrix containing all letters 
and characters, yield an unbalanced datasets composed of 1/6 of ERP signals and of 5/6 
of noERP. Another example of unbalanced classes can be found in RT tasks in which 
the participant has to respond as fast as possible to the appearance of predetermined 
stimulation. For example, standard experimental psychology protocols make use of 
biased probabilities across experimental conditions (see e.g. |13]). Besides manipulated 
factors, biased probability might also be the result of participants behavior, such as 
errors which are typically much lower than correct trials. 

In this paper, we target more specifically situations where the class-covariance 
matrices differ and available datasets are unbalanced and of limited size. We propose 
a wavelet domain Caussian linear mixed model (termed LMM in the following) for the 
binary signal classification problem. The model expresses each single-trial as a sum of 
a class-dependent signal of interest and a background activity as in [2H1 [2Z] • The signal 
of interest is modelled as a multivariate Caussian vector whose covariance matrix, that 
describes the inter-trial variability, depends on a user-specified design matrix. Both 
mean and design matrix are class-dependent. The background signal is modelled as 
a class-independent Caussian white noise. The resulting model is characterized by a 
remarkably small number of parameters. 

The application context of the current paper is the detection of evoked potentials 
in M/EEC signals. The above described model turns out to be relevant for such signals 
when suitable preprocessings are performed, namely wavelet based time decorrelation, 
spatial filtering and corresponding dimension reductions. The procedure is more 
specifically applied to the problem of error negativity (ErrP) detection and analysis. 
Corresponding classihcation results compare very favorably with standard linear 
classihers for small sample size datasets. 

One of the main contributions of this work is a model for subject specihc M/EEC 
signals, that belongs to the family of linear mixed models. Mixed models offer a 
rich and hexible framework for describing and quantifying variability and produce 
robust estimates of class-covariance matrices from small datasets. Such models have 
been considered in EEC applications in many different contexts. Let us for example 
mention [2| where an introduction to mixed model is provided and possible applications 
in psychology and neuroimaging are discussed. Mixed models have also recently been 
used by other authors in bayesian framework for electrophysiology applications. In [T3] , 
a functional mixed model is introduced for the purpose of regression analysis of ERP’s 
data and where the model is combined with discrete wavelet transform and sparsity- 
inducing prior distribution. In [HI a subject-independent classiher is proposed using a 
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£i-penalized linear mixed model. Within-subject and between-subject variabilities are 
modelled through random and hxed terms and the method is applied to BCI achieving 
subject-to-subject transfer. The present work, inspired on the work of Huang and 
coworkers [28l[27], focuses on a subject specihc model taking into account the variability 
across repetitions of the same experiment. The model proposed here is constructed 
directly in wavelet domain instead of time domain, and also differs from [28l EZ] in both 
hxed and random parts. In our case the hxed part corresponds to the class-average 
whereas they used projection onto the hrst principal components. The modelling of 
inter-trial variability is diherent too. All together this results in a simplihed Gaussian 
linear mixed model with less parameters to estimate. Moreover the application of the 
model to the binary classihcation problem is diherent. In our work we directly exploit 
the estimates of class-covariance matrices in a Bayes classiher while Huang et al pSl E7] 
use a likelihood test based on the predicted single-trial signals. Let us stress that we 
only address and model here the inter-trial variability. Even though the dataset used 
for validation involves several subjects, they are all treated independently and we do 
not attempt to model the inter-subject variability. 

This paper is organized as follows. A short overview for linear and quadratic 
discriminant methods for unbalanced data is given in section The model and 
the statistical procedure are described in section while section is devoted to the 
application to ErrP data. Discussion and conclusions are given in sectionand section]^ 
respectively. 

2. A short overview of linear and quadratic discriminant analysis for 
unbalanced data 

We briehy discuss here the main issues encountered when using linear and quadratic 
discriminant analysis for unbalanced and small size datasets, and describe some classical 
solutions that will be of interest for this work. 


2.1. Linear and Quadratic discriminant analysis 


Let us take the probabilistic point of view, and derive Linear Discriminant Analysis 
(LDA) and Quadratic Discriminant Analysis (QDA) from the decision problem in the 
case of a Gaussian mixture, see e.g. [23] . Observations are considered as multivariate 
Gaussian draws with respective prior probabilities (with ~ 1) probability 

density function (pdf): 


r(^) = 


exp 


(27r)'^/2|E'=|i/2 

where /i'^ G is the class-mean, G 


.-(x-/iQ'(SQ-i(x-pQ 


( 1 ) 


\)dxd \ 


is the class-covariance matrix (assumed to 


be invertible) and its determinant. 

This leads to the discriminant function, dehned for each class c: 
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Let Xi denote the i-th sample to classify. Xi a d-dimensional vector. The decision rule is 
given by assigning Xi to the class with the maximal 5^{xi). In the general case, the class- 
covariances are different, and the decision should be based on quadratic functions of 
Xi (thus the name QDA). When the class-covariances are assumed to be equal (S'^ = E 
for all c) the decision is based on linear functions of Xi (LDA). 

In practice the parameters of the Gaussian distribution in each class are unknown 
and must be estimated, which is problematic in the case of small size and/or high 
dimensional datasets. We address this problem below, focusing on binary classihcation 
(c G {0,1}) and unbalanced datasets. We will denote by A^° and the corresponding 
sample sizes of datasets from which the estimation is done, and set N = + N^. 

and E will denote respectively the sample estimates of class-covariance matrix E'^ and 
common covariance matrix E dehned as follows: 


Ar= 


Ar= 


E" = 


N <^-1 


—7 ~ — x^y , where 


= 


2=1 






2=1 


E - —E° —E^ 


( 3 ) 

( 4 ) 


Two cases are to be considered. 


Case 1: Egual class-covariance matrices. When E° = E^ = E, the decision is based 
upon the sign of w'xi, where w = —/i^). The quality of the decision relies heavily 

on the quality of the estimation of the covariance matrix E and its inverse, which may 
turn out to be poor for small datasets and/or in high-dimensional situations. In addition, 
low quality estimate for E may lead to invertibility problems, which makes the classiher 
ill-dehned. Solutions to such problems are discussed below. 

Let us note that for equal class-covariance matrices, unbalancedness of the datasets 
does not introduce additional difficulty as E is estimated from the complete dataset. 

Case 2: Unegual class-covariance matrices. In this case, two covariance matrices must 
be estimated, which amplihes the above mentioned difficulty, particularly when the 
dataset is small and unbalanced enough so that at least one of the two covariance 
matrices is poorly estimated. This is why LDA is generally preferred, even when the 
true class-covariances are different. However, when the datasets are unbalanced, the 
estimation tends to focus on the prevalent class and to ignore the rare one: the covariance 
estimate S given in (|^ is dominated by the majority class, so that S S° (in the case 
where N^). 

In such a situation, a common solution consists in re-balancing datasets using 
sampling methods [6lll62lE7|. Over-sampling increases the size of the minority class 
(without information gain) while under-sampling reduces the size of the majority class 
(which may result in prejudicial information loss). 
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We stick here to binary classification problem, in the small sample size {d 3> N) and 
unbalanced {N^ N^) situation. In such case, estimated covariance matrices are 

often non-invertible, and some additional assumptions or regularization are needed. We 
review a few solutions that will be of interest for us, that ensure invertibility by enforcing 
diagonal dominance of the covariance matrices. 


(i) Diagonal covariance . If data are assumed to be decorrelated, the covariance matrix 
is diagonal and only the d variances are to be estimated (leading to Diagonal 
LDA, DLDA for short). Under the multivariate Gaussian law, this assumption 
corresponds to independence assumption. However it is known that even for 
correlated data (which is the general situation) better classification results are often 
obtained when correlations are ignored, thus replacing S by its diagonal (leading 
to the so-called naive Bayes classifier). This is supported by asymptotic arguments 
{d S> n) as well as experimental results (see e.g. [ 6 ] and references therein). In 
particular if the off-diagonal elements of the covariance matrix are expected to be 
nearly zero, it is usual to estimate them by zero. Otherwise in some situations, a 
decorrelating transformation can be found that enforces the diagonal dominance of 
the covariance matrix (see section 3.1). 

The diagonal assumption as well as the above considerations can be extended to 
the quadratic situation (leading to the diagonal QDA, DQDA for short). 

ii) Diagonal dominance enforcing regularization. Singularity issues can be overcome 
by shrinking the sample covariance matrix towards a multiple of the identity 
matrix 123. A particular trace preserving shrinkage approach has been proposed 
in in the context of EEG single-trial analysis. The corresponding estimate is a 
weighted average of the sample covariance matrix and the identity matrix 


^( 7 ) = (l-7)S + 7 


tr(U) 

d 




(5) 


where 7 G [ 0 , 1 ] is a shrinkage parameter and tr(S) is the trace of the sample 
covariance matrix, i.e. the sum of its diagonal elements. The standard practice is 
to find the optimal 7 by cross-validation. Blankertz et al give a simple heuristic 
estimate with satisfactory results on EEG data [7]. This method will be called 
Regularized LDA (RDA for short). 


(iii) Explicit covariance modelling. When prior information on data is avalaible, explicit 
models for covariance matrices can sometimes be proposed to reduce the number 
of parameters to estimate. We describe below an example where a suitable 
transformation combined with a linear mixed model leads to diagonal dominant 
class-covariance matrices characterized by (2 -|- 2 x d) parameters only. 
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3. Method 

We propose a classification method for multisensor signals to discriminate between two 
experimental conditions in view of application to EEG and BCI. The signals are recorded 
over M sensors in a fixed time-period with T samples. For a given participant, we obtain 
trials per class, c G {0,1} being the class label, with each trial taking the form of 
M time series of T samples each. 

Our procedure is based on three steps. The first step is preprocessing: 
transformations are applied to obtain diagonally dominant sample class-covariance 
matrices and the temporal and spatial dimensions are reduced. The second step 
models variability using a Gaussian linear mixed model to estimate the class-covariance 
matrices. The third step is classification. 

3.1. Decorrelation and dimension reduction 

Raw recorded signals are known to be strongly correlated both across sensors and 
time points. Data are thus preprocessed in order to reduce dimensions and enforce 
the diagonal dominance of class-covariance matrices. 

3.1.1. Time-domain decorrelation using DWT. Signals are recorded as time-series, 
with a given sampling frequency. However, samples are expected to be highly correlated, 
should they correspond to the background activity or to the signal of interest. Therefore, 
it is important to reduce the correlations, which we do through a linear transform, by 
replacing the time samples with the coefficients of the signals expansion with respect to 
a suitably chosen basis. In this work, we limit ourselves to a wavelet basis, generated 
by shifting and rescaling a generic waveform ip. More precisely, in the framework of the 
so-called multiresolution analysis, it is possible to find pairs of functions (0, pj) such that 
suitably rescaled and shifted copies of these form orthonormal bases of signal spaces of 
interest. Without going into mathematical details, for which we refer to e.g. [T3l l36l l55] 
for detailed accounts, and also to [lO] for a pedestrian introduction, this yields multiscale 
signal decompositions as follows. Given a reference (integer) scale index mo, any signal in 
the signal space can be uniquely expanded as time-shifted copies of the scaling function 
at scale 2™° and time-shifted copies of the wavelet, rescaled at smaller scales 2"*, m < mo 
(also called levels). This is expressed mathematically in the form 

n m<mo n 

The coefficients that enter such an expansion are called respectively scaling coefficients 
(for the coefficients Smo,n) and wavelet coefficients (coefficients dm,n) and are easily 
computed as inner products of the signal / with the corresponding basis functions. The 
numerical computation of these coefficients is performed using dedicated fast algorithms 
(see [55]), which leads to the so-called DWT (for discrete wavelet transform). In what 
follows, we will gather scaling and wavelet coefficients of a signal in a unique vector of 
multiscale coefficients. 
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Wavelets are introduced here because of their ability to decorrelate signals. It 
has been found empirically in many application domains (such as image processing) 
that the nonzero (or significant) multiscale coefficients of a correlated signal are often 
far less correlated than the signal, and that the number of significant coefficients is 
therefore much smaller than the signal length. This turns out to be the case for 
the EEG signal considered here. Therefore, a (time-domain) dimension reduction can 
be performed by moving to the multiscale domain using DWT, and setting to zero 
irrelevant multiscale coefficients, namely those coefficients that are numerically negligible 
for all channels. The resulting coefficients are therefore weakly correlated (and the 
corresponding covariance matrix is strongly diagonal dominant). It is worth noticing 
that wavelets only provide approximate decorrelation. Alternatives to wavelets have 
been proposed for improving the decorrelation by seeking optimally decorrelating basis 
(see [59] for detailed account of the best basis approach and [Mill] for the adaptation to 
EEG signals). These techniques are adaptive and, as a consequence, the decomposition 
basis is signal dependent, which is not suitable for the model we are using here. 

In addition to dimension reduction, a level-dependent rescaling is performed on 
the retained multiscale coefficients to correct for variability differences across scales. 
More precisely, at each scale, coefficients are normalized so that they all have the same 
variance. 

As a result of this time (or time-scale) domain processing, each trial i gives rise to 
a multiscale coefficient matrix Xf G (c being the class label), where K is the 

number of retained multiscale coefficients and M being the number of sensors. 

3.1.2. Spatial filtering. The considered signals are multisensor signals, with low spatial 
resolution and large spatial correlations. These can be reduced using spatial filtering. 
In single-trial analysis, spatial hlters are used to improve signal-to-noise ratio while 
reducing both EEG data complexity and spatial dimension. Various spatial filter 
constructions have been proposed, depending on EEG application (see [121 EH] for 
reviews). 

In the present study, we rely on a matrix-based technique inspired by Fisher’s linear 
discriminant method [20], as done in m The main purpose is to identify projections 
in the sensor space that keep the classes as separated as possible while minimizing the 
variance within classes. We introduce the between-class and the within-class covariance 
matrices Sb and Sw^ respectively 

( 6 ) 

c=0 
.. 1 

= , ( 7 ) 

c=0 i=l 

where G and X G are the trial-average in class c and the total trial- 

average matrices. Here, N = is the total number of trials. Notice that Sb and 

Sy/ are both M x M non-negative definite matrices. 
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The method seeks to identify the most discriminant linear combinations of sensors 
by optimizing the following Fisher criterion, seqnentially nnder orthogonality constraints 
between vectors: 

u'Sbu 

max —- 

•ueiR*^ u'bwu 

The solntion is obtained by compnting the eigenvectors decomposition of the matrix 
S^Sb (assuming that Sw is invertible). The eigenvectors are sorted by decreasing 
discriminating power (eigenvalue), out of which the hrst J are selected. The number 
J is chosen according to a criterion based on the cumulative percentage of eigenvalues. 
In most situations, J ^ M, which yields a signihcant spatial dimension reduction. 
The J uncorrelated linear combinations of the M sensors will be called channels in the 
following. 

Let U = [«!,..., uj] denote the matrix whose columns are the selected eigenvectors. 
Each single-trial signal Xf is projected onto the selected subspace spanned by these J 
eigenvectors as follows 


Y‘ = U'X‘ 


(9) 


where each row j of the matrix Yf G consists in the wavelet and scaling function 

coefficients of the j-th channel for single-trial i. Each column of U, denoted Uj, can be 
referred to as spatial filters (see [3HI El la 02]). 


3.2. Model setup 

In this section a single-trial analysis is proposed, in which the variability is modelled 
through a Gaussian linear mixed model |T71 [37] . 

3.2.1. Background and definition. Given multisensor and multi-trial signals fifit), 
with i the trial index and s the sensor index, that are labelled by a class index c, we 
assume that such signals can be modelled as a sum of two independent components: 

• A background activity, assumed to be stationary (in the sense of weakly stationary 
random processes, namely the two hrst order moments are translation invariant) 
and correlated (see e.g. jHT] I. 

• Event related components (also called signals of interest), thus intrinsically non¬ 
stationary, characterized by a class index which labels the cerebral response to the 
events. The signal of interest is further modelled as the sum of a hxed component, 
common to all the trials in the class, and a random trial dependent component (the 
so-called trial random effect). 

In what follows, the signals that will be modelled in such a way are the multi-channel 
multiscale (wavelet and scaling) coefficient arrays Yf G resulting from the 

preprocessing. We will denote by yf G the vector obtained by concatenation of 
all columns of Yf. 
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3.2.2. The statistical model. For a given participant, we consider trials per class, 
c G {0,1} being the class label. After preprocessing, we represent each trial i as a vector 
Hi G where J is the number of channels and K the number of retained wavelet 

and scaling coefficients. In a given class c and according to the additivity assumption 
(see section 3.2.1 above), (i = 1, • • •, N^) is therefore written as 

vt = + T%i + Si , ( 10 ) 

where 


• /i'^ G is the mean vector of class c, 

• bi A/'(0, r^) is a zero-mean Gaussian random variable with variance r^, modelling 
the trial random effect, 

• G is a class-dependent coefficient vector which modulates the impact of the 
trial on each sensor and sample, 

• Si ^ A^( 0 , cr^I jk) is the residual part modelling the background activity where I jk 
denotes the identity matrix of size JK. We further assume that the residual vector 
Si and the trial random effect bi are independent. 


After specihcation of the coefficient vector F'^ (see section 3.2.3 below), the model 


given in (10) becomes a Linear Mixed Model (LMM), such that 


Af (/i^ W) 


where W = r^F" (F'^)' + a%K , 


( 11 ) 


(here (F'^)^ denotes the matrix transpose of F^). 

In summary, each single-trial is decomposed into a hxed part, corresponding to 
the class-mean, a trial random effect bi modulated by the coefficient vector F'^, and a 
random term accounting for the background activity. 

Let us note that the simplicity of the proposed model in wavelet domain permits 
to obtain precise parameter estimates over a small number of trials. Indeed, after 
specifying the coefficient vectors F'^, we only have to estimate four parameters: the two 
mean vectors jp, c G {0,1}, and the variances and cr^. 


Remark 1 (i) Since the class-mean is used as the fixed part, the model described 

in (10) reduces to a mixed-effects ANOVA model ISD, where EEG-class is the hxed 
effect factor and trial is the random effect factor. This allows grounding the model 
on very solid foundations, and yields a more straightforward interpretation of the 
results. 


(ii) In the case of the between-trial variability is negligible, there is no random effect 
T^bi in (10) and the model corresponds to a simple linear one, as described in [7]. 


(iii) A related approach, that actually inspired the present work, was proposed by 
Huang et al in |2Sl EZ]. Our modelling, however, differs in three major points. 
First, we consider a linear mixed model on wavelet domain more adapted to the 
assumption of the noise decorrelation. Second, the hxed part is set to the class- 
mean and does not involve any projection. Third, the covariance matrix depends on 
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channel, sample and class while Hnang et al propose dependence on trial, sample 
and class. We stress that onr model is signihcantly simpler, avoids nnnecessary 
approximations and reqnires less parameter estimations. 

(iv) Sticking to snch a simple model is motivated by nse cases where few trials are 
available. This is the case nnder some experimental conditions where the nnmber of 
trials of interest is low, either becanse of the experimental design (e.g. probability 
bias), or because rare behaviour of the participants ( e.g. errors in RT tasks). 
Another example is BCI type data, for which the training set has to be limited 
to reasonable size, and to which high complexity models therefore can hardly be 
htted. 


3.2.3. Random part coefficient vector. To completely specify the model in (10), the 
(class-dependent) coefficient vectors T'^ G have to be chosen. Let us note that the 
components of these vectors modulate the random effects bi in the same way for all 
trials in the same class. T'^ is a design parameter to be fixed a priori. The model itself 
doesn’t assume any specihc form, the choice of T'’ is problem dependent, and usually 
relies on prior information. In the considered situation, such prior information is not 
available, the choice of T'’ was guided by preliminary exploration of the data. We refer 
to section 14.2.21 for details. 


3.3. Classification procedure 

Each single-trial yi G to classify is assigned to a class c according to the maximum 
a posteriori calculated using the Bayes formula: 

max P(Class = Cyf). (12) 

ce{o,i} 


Based on the model proposed in (10), yi is assumed to be distributed in each class c 
according to the multivariate Gaussian distribution 7V(/i‘’, V^) where ffi is the class-mean 
and V‘^ is the class covariance matrix such that V‘^ = r^T'’ (T'’)^ -|- ffiljx. 

Under these assumptions, the optimal classihcation is obtained with the following 
quadratic discriminant function for each class c : 

gc(Pi) = (Pi - M7(VT\P^ - - 21n(p^) + In |W|, (13) 

which yields to the following binary classifier : 

^{vi) = 9o{yi) - 9i{yi) ■ (14) 

The trial y^ is assigned to class 0 if (5(|/i) < 0, to class 1 otherwise. 

Remark 2 Let us note that the resulting classifier is a particular case of the QDA 


classiher presented in section |2.1[ In the situation considered here, the class-covariance 
matrix S'’ is decomposed in the following simple form (h'’)^ -|- Therefore, 

after specifying the coefficient vector T'’ G for each class, only and have to 
be estimated, which makes QDA tractable in this situation. 
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3 . 4 . A specific methodology for EEC single-trial classification 

We shortly summarize the main steps of the LMM-based classification procedure 
(hereafter termed LMMC), which consists in two stages: a training step to estimate the 
model parameters and the test step to classify the EEG single-trials. For the reasons 
exposed above, in many contexts as BCI, the training set shall be as small as possible 
to get sufficiently precise estimates to correctly classify test set. 


3 . 4 . 1 . Training Step. 


(i) Preprocessing. 

First whitening and time-domain reduction are performed through DWT. The 
K selected multiscale coefficients are scaled to set the coefficient variability 
independent of the decomposition level. Then a spatial dimension reduction 
is performed. Spatial filters are estimated using a matrix-based Fisher’s linear 
discriminant and only the J most discriminant channels are selected. 



Wavelet-based modelling. 

Single-trials in the multiscale domain are modelled as in (10), with the class-mean 
as a fixed component and the random component depending on class, channel and 
trial. In the latter, the class and channel dependence is given by the coefficient 
vectors T^ and T^ that must be specihed a priori, according to (18) in this model. 


LMM parameter estimation. 

After fixing T^ a priori, the complete parameter 9 = /i^, r^, cr^) G 

of the LMM can be estimated by likelihood-based methods using the assumption 
that h and £ are independent and normally distributed. We use Restricted 
Maximum Likelihood (REML) method that provides unbiased variance estimates 
(see e.g. m)- The estimates will be denoted hy 9 = (/i°, r^, d^). 


3 . 4 . 2 . Test Step. 

(i) Preprocessing. 

The transformations are the same as in the training step. 

(ii) Classification rule. 

The classiher 6{yi) is obtained from the plug-in estimates of the discriminant 
functions given in (13). In the expressions (11) to (14) the parameters 
are replaced by their estimates computed during the training step. 

From the variance component estimates and d^, we obtain the estimation of the 
class-covariance matrix described in the model (0: 




.JK 


( 15 ) 
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3.5. Single-trial reconstruction 

In addition, an estimate can be obtained for each single-trial i in class c through the 
Henderson mixed model equations (see e.g. ED) : 

k = f\r)‘(VT\v. - if ) ■ (16) 

This, in turns, yields an estimate for the signal of interest as follow : 

yt = fi^ + Wh . (17) 

Notice that unlike Huang et al |2Hl 12^, this estimation is a sub-product of the 
model allowing single-trial visualization and does not play any role in the classihcation 
procedure. 

4. Application to error potentials detection 
4 . 1 . ErrP Data and preprocessing 

We now consider the problem of modelling and detecting error potentials using the 
approach developed above on the dataset reported in [19]. We start with a brief 
description of main aspects of the experiment that are relevant for the present study. 
Details on recordings and EEG data preprocessing are not exposed here since they had 
been largely detailed in [IH] and are beyond the scope of the present paper. 



Figure 1. Grand averages (FCz) of error (red) and correct (blue) trials for 
monopolar data. 


4 . 1 . 1 . Experiment and signal acguisition. The dataset includes 10 participants 
performing an Eriksen’s flanker task [TB]. Each trial consists in the identihcation of 
a central letter (target) embedded in a set of three letters. Participants were asked to 
respond to the target letter by pressing a button with either the left hand or right hand. 
The signal was recorded with 64 scalp electrodes and after preprocessing (sampling to 
1024 Hz and artifacts removal), data were downsampled to 256 Hz. 
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The selected trials were segmented into epochs of 800 ms from —400 ms to +400 ms, 
were zero corresponds to the electromyogram (EMG) onset that triggered the response. 
Each dataset in the training step correspond to a three-dimensional array as we consider 
trials in both classes c = 0 (correct trials) and c = 1 (errors) over 64 electrodes and 
204 time-points. 

J 1 ..I. 2 . Error/Correct trials classification: an unbalanced dataset. For each participant, 
we consider two categories of trials: error and correct. In this current application we do 
not take into acount the third class discussed in [IS] which corresponds to partial errors. 
In hgure [T| differences on monopolar data between error and correct are observable on 
grand averages (over trials and subjects). For errors, a clear negativity is observable 
around 150 ms after the EMG onset (zero of time) whereas on correct trials a large 
positive wave occurs just after EMG onset, but no negativity is visible [|] 

The number of trials per class differs greatly for each participant. In table 
the total number of trials for error and correct responses is given. The percentage 
of error trials approximately ranges from 2% to 12%, the dataset is therefore highly 
unbalanced, errors forming the minority class. The main goal here is to achieve single¬ 
trial classification on this unbalanced dataset. 

For an illustrative purpose throughout the section, results are presented for 
participant A. Results concerning the other participants are given in Supplementary 
Data. 


Table 1. Total number of trials per participant for error and correct responses. 






Participant 






A 

B 

C 

D E 

F 

G 

H 

I 

J 

Error 

130 

105 

43 

39 18 

28 

94 

145 

100 

63 

Correct 

1376 

1575 

1467 

1735 690 

759 

1844 

1238 

1167 

907 


4 . 1 . 3 . Preprocessing. The following two-step preprocessing for spatial and temporal 
decorrelation and dimension reduction is applied to the ErrP data. 

Decorrelation and time-domain reduction. DWT was performed using a Daubechies 
hlter D6 (see [131 EHl ES] for details) with 5 decomposition levels. Given the sampling 
frequency, those 5 levels correspond to the following frequency bands: 64 — 128Hz, 
32 — 16 — 32E[z, 8 — 16Hz and 4 — SHz respectively, corresponding to wavelet 

f On correct trials, Current Source Density (CSD) analysis has revealed that a negative activity similar 
to the one observed on errors exists, but with a much smaller amplitude |49j . In the present context, 
since discrimination was the main goal, we did not resort to CSD analysis for maximizing the difference 
between correct and errors. This is why such a small negative wave is not observable on correct trials 
in this case 
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coefficients, the lowest frequency band (0 — AHz) being encoded in scaling coefficients 
(see section 3.1.1). 

Prior to DWT, signals were first extended using zero-padding. Doing so, we obtain 
a set of signals with appropriate length (for the DWT implementation which we use [9], it 
is convenient that the signal length be a power of two). Zero-padding is also an extension 
method which permits to avoid misinterpretation about the behaviour of signals beyond 
boundaries. 


Remark 3 The choice of D6 is the result of a tradeoff between localization and 
smoothness for the resulting wavelet. Wavelets with larger support generate more 
important boundary effects (we recall that in the dataset under consideration, signals are 
epoched, so that boundary effects have to be accounted for). Shorter filters yield wavelets 
with lower smoothness, so that the adjusted signals can also lack smoothness. For 
example, adjusted signals obtained using the Haar wavelet are discontinuous, which we 
didn’t find satisfactory, while adjusted signals obtained with D4 are non differentiable. 
Further tests were made using Daubechies filters with various lengths. Corresponding 
results (which can be found in Supplementary data) are fully consistent with the results 
presented here. 



Figure 2. Illustration of the relevance of DWT and multiscale coefficients selection. 
Figures (a), (b) and (c) represent error and correct responses averages over FCz. 
(a) Error (red) and correct responses (blue) averages obtained from the 24 selected 
coefficients projected back to time domain and compared with time-courses averages 
(dashed lines). Partial reconstructions based on scaling (b) and wavelet coefficients (c) 
are also represented separately. 

Second, dimension reduction was achieved by removing the first three 
decomposition levels, that contain low amplitude high frequency phenomena in the 
signals. The truncation of the EEG wavelet coefficients sequence to 2 decomposition 
levels amounts to a projection onto an appropriate subspace pQ that keeps the relevant 
information in the remaining wavelet and scaling coefficients. 

Finally, keeping in mind that the signals were zero-padded, boundary wavelet 
coefficients that originate from the zero-padding are excluded from the statistical 
analysis, as they turn out to exhibit a behaviour different from the relevant ones. 
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More precisely, coefficients likely to be sensitive to bonndary effects were not taken into 
acconnt, namely the 2 bonndary scaling coefficients, the 2 bonndary wavelet coefficients 
at coarsest scale, and the 4 bonndary wavelet coefficients at hner scale. After this step, 
K = 2A coefficients are selected: 12 + 6 wavelet coefficients for decomposition levels 4 
and 5 respectively and 6 scaling coefficients. 

Fignre illnstrates the relevance of mnltiscale coefficients selection. The 24 
coefficients remaining after DWT were projected back to the temporal space using an 
inverse DWT (IDWT). As seen in hgure a), errors and correct responses averages 
obtained from the selected coefficients are very similar to the 204 time-points averages. 
In addition, one can see that scaling coefficients (b) capture the general waveform while 
wavelet coefficients concentrate the details (c). 

In the following, the selected multiscale coefficients are referred to as multiscale 
features. 



Figure 3. Spatial filtering results for participant A. (a) Using a cut-off of 70% of 
discriminative power, J = 8 filters are selected based on the screeplot of eigenvalues, 
(b), (c), (d) display the three first selected filters, ordered by decreasing order of 
discriminative power. 


Spatial filtering. Using the matrix-based Fisher’s linear discriminant proposed in 
section |3.1.2[ spatial hltering consists in identifying the most discriminant hlters to 


separate errors and correct responses. The number J of selected hlters depends on the 
cumulative percentage of discriminative power, hxed to 70% in the present study for all 
subjects. This choice is somewhat arbitrary. We have chosen to stick to a very simple 
criterion, but other criteria could have been used as well. For example, Kaiser’s rule 
has been tested on our dataset, and results were not satisfactory in the sense that the 
number of selected channels was larger (and the computational load was increased), 
without improvement of the classihcation rates. 

Figure l^a) represents the screeplot, which plots the ordered eigenvalues, associated 
with the 64 electrodes and out of which only 8 hlters are needed to concentrate 70% 
of the discriminative power. Figure]^ (b), (c), (d) correspond to topographical maps 
of the hrst three spatial hlters i.e. the weights of the electrodes in the corresponding 
projections. The contributions of multiscale features in each class can be visualized 
using their spatial signature dehned as their backprojection onto the sensor space: for 
each multiscale index, the pseudo-inverse of the spatial hlter matrix is applied to the 
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vector of the corresponding coefficient values at all channels. For the sake of illustration, 
we display in hgure the spatial signatures of some multiscale features averaged across 
trials. 



Seal. 4 Seal. 6 Wav.lO Wav. 12 Wav. 22 

Figure 4. Topographical maps of spatial signatures of five multiscale features 
averaged across trials. 

Here we have chosen to display hve features corresponding to scaling coefficients 
4 and 6 and to wavelet coefficient 10, 12 and 22. These specihe multiscale coefficients 
were chosen for their relevance for error, correct or both classes: on average they appear 
as emergent, large magnitude coefficients (see hgure below). For example, scaling 
coefficient 4 has the largest average magnitude in both classes and was found to be 
highly discriminant. This behaviour is rehected by the corresponding spatial signatures 
(dehned above) which are strongly negatively correlated. Scaling coefficient 6 is relevant 
only for the correct class and so is its spatial signature. Wavelet coefficients 10 and 22 
are only relevant for error whereas the coefficient 12 is not discriminant. In agreement 
with a large literature on ErrP, the spatial signatures for the relevant multiscale features 
mainly loads on fronto-central sensors. 

4-2. Modelling ErrP single-trials 

After applying spatial and temporal hlters, the extracted features from ErrP data are 
decorrelated and dimensions are signihcantly reduced. We now consider multichannel 
trials in the wavelet domain, such that Yf G with K = 24 and J M (for the 

training set example J = 8, see hgure |^. Modelling ErrP single-trials is based on the 
columnwise vectorized trial y^. 

4 . 2 . 1 . Testing class-covariance matrices ineguality. We display in hgure the sample 
covariance matrix for each class [error and correct), calculated after preprocessing. For 
simplicity the hgure is limited to the hrst two channels. In addition for the relevance 
of the visual comparison, the two matrices have been computed on datasets of equal 
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size (actually maximal possible size, to ensure maximum precision). Visual inspection 
reveals different diagonals and noticeable non-diagonal terms in the sample covariance 
matrix of the error class. The significance of the difference between these two matrices 
is evaluated quantitatively using Box’s M test [2]. The latter clearly rejects the class- 
covariance matrices equality null hypothesis (p-value -C 10“®). 

Importantly, this conclusion differs from the claim of [7], who did not see any 
noticeable difference between sample class-covariance matrices (on a different EEG 
dataset). Although such a finding may be dataset dependent, we stress that for the 
dataset considered here, the significant difference results from time and space dimension 
reductions, which yield an important denoising. In the absence of dimension reductions 
(for example raw data), the sample covariance matrices are generally singular and the 
Box’s M test cannot be performed. 



Figure 5. Representation of 48 x 48 sample class-covariance matrices for 24 multiscale 
coefficients and 2 channels : error class (a) and correct class (b). 


4-2.2. Choice of the random part coefficient vector T'’. As mentioned in section 3.2.3 
the choice of T'’ is problem dependent. In our application no prior information was 
available that could help specifying the form of T'’, we thus relied on preliminary data 
exploration. We plot in hgure the mean in absolute value and the standard deviation 
computed over all trials for participant A on the first channel in the two classes. This 
reveals a monotonic relationship between the amplitude and variability of the signal of 
interest, with an additional offset effect. The latter can be interpreted as originating 
from random noise £ in (10). Similar results are obtained for the other participants and 
channels. 

In the present work, we take the simplest relationship, namely a linear one, by 
setting the vector T”’ to the average over trials in class c: 


1 




W = — V 


2=1 


(18) 
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(a) 



(b) 



Figure 6. Illustration of the monotonic relationship between amplitude and 
variability in ErrP-related EEG signals: mean in absolute value {abs(mean)) and to 
the standard devation [std) both calculated across all trials of the first channel, (a): 
error ; (b): correct. 


4-2.3. Variance components estimation. We now explore one of the main assumptions 
of our study, namely the single-trial signal of interest can be written as the sum of 
two components: a class-dependent fixed term (the class-mean) plus a trial-dependent 
random term (the so-called random effect). The hypothesis to be tested is the 
signihcance of the random effect variance r^. A negative answer would indicate that a 
random effect is not needed. In that case, a classical linear model as proposed in 
would be more appropriate. 

The implementation of the mixed model has been done using the MATLAB function 
mixed.m written by Witkovsky [^0] (more details can be found in Supplementary Data). 
REML estimates for variance components and are displayed in hgure Each 
boxplot corresponds to a different sample size (10% error trials, 90% correct trials) 
and summarizes the distribution of estimates performed on 100 different training sets. 
We note that the magnitude of the residual variance estimates is larger than that 
of f^. Globally the dispersion of the estimates decreases as a function of the sample 
size, nevertheless the hgure shows that good estimates can be obtained even with small 
sample sizes. In hgure (a) the quartiles of the distribution tend to increase slightly 
with the sample size and minimum estimates rise above zero. Using a z-test we test 
that the random ehect variance is strictly positive in average for each training sample 
size (p-value < 10“^). 

4-3. Classification results 

4-3.1. Performance evaluation. In the case of unbalanced classes, the performance 
evaluation methodology has to be chosen carefully as the majority class may shadow 
the behaviour of the minority class. This problem has been discussed in several works 
(see e.g. [ssiEziiin] and references therein). In this paper, we use various performance 
measures for comparing unbalanced EEG dataset classihers. In particular, we focus on 
class-dependent evaluations. 








estimates 
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(a) 


0.25 



20-200 40-400 60-600 80-800 100-1000 120-1200 


(b) 



20-200 40-400 60-600 80-800 100-1000 120-1200 


Figure 7. Covariance components estimation, (a) Random effect variance P 
estimates, (b) Residual variance cr^ estimates. Each boxplot displays the series of 100 
parameter estimates for different training sample sizes (10% error trials, 90% correct 
trials). The central mark of boxplot is the median (second quartile), the bottom and 
top of the box are the first and third quartiles, the whiskers extend to the most extreme 
values not considered outliers, and outliers are plotted individually. 


- Confusion matrix. In a binary classification, given a classifier and a test sample, 
four outcomes are possible: true positive, false positive, true negative and false 
negative. For instance, if the test sample is positive and it is classified as negative, 
it is counted as a false negative. For a test dataset, the number of occurences of the 
four outcomes are respectively denoted by (TP), (FP), (TN) and (FN), and form 
the 2x2 confusion matrix given in table 


Table 2. Confusion matrix for the ErrP classification problem. 


actual correct 
error 


predicted 
corret error 


TN FP 
FN TP 


- Good classification rate. In the particular case of unbalanced data, the good 
classification rate must be calculated for each class independently. Indeed, a global 
good classification rate may lead to misleading results. For example, in our problem, 
where the errors only represent 10% of the data, the naive classification strategy 
allocating all testing data to the majority class would automatically achieve a good 
classification rate of 90%. However the classifier is obviously irrelevant for detecting 
error trials. 

- Pierce’s Skill Score. Given the confusion matrix, the good prevision rate H = 
TP/{FP + TP) and the false alarm rate F = FP/{TN + FP) for the error class 
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lead to the so-called Pierce score 

PSS = H-F. (19) 

By construction, PSS G [—1;-|-1]. In our context the closer PSS is to -|-1, the 
better the classifier is for the detection of error trials. 

4-3.2. ErrP detection. For the considered dataset the number of error trials {N^) 
is much smaller than the number of correct trials (iV°). For performance evaluation, 
unbalanced training sets of increasing size are generated, all based on the a priori 10% of 
errors. For each participant and each randomly drawn unbalanced training set (namely, 
{N^ = 20, = 200} , = 30, = 300},...), the test set is composed of all 

remaining data, therefore its size differs. For each participant this splitting is performed 
100 times and classifiers performance are recorded. 

The proposed method (named LMM hereafter) is compared with several classifiers: 
classical linear discriminant analysis (LDA), regularized method (regularized LDA - 
RDA - as proposed in m) and diagonal discriminant analysis (diagonal LDA (DLDA) 
and diagonal QDA (DQDA)). In addition, LMM is also compared to LDA, RDA and 
DLDA classifiers after sub-sampling the majority class to balance training sets. The 
classic QDA classifier cannot be used, because the sample covariance matrix in error 
class is very often singular in the considered range of sample sizes. Corresponding 
results are displayed in figure Let us stress that the error bars represent the standard 
deviation of the good classification rates distribution. Since the training and test sets 
are complementary, when the size of training set increases, the number of single-trial 
signals used to evaluate the good classification rates decreases, and consequently the 
error bar increases. These error bars only measure the dispersion of the estimates and 
cannot be used for quantitative pairwise comparison (see below). 

Clearly LMM greatly outperforms each method in terms of error detection in 
all situations. This result is particularly spectacular for small training set which we 
interpret as follows. The simplicity of the underlying model (which involves very few 
parameters) and its relevance for the considered dataset yield accurate and robust 
parameter estimates even for small sample size. In addition, our method performs 
equivalently to other methods to detect correct trials which correspond to the majority 
class. Even with small sample size, classifiers detect correct trials with a good 
classification rate close to 100%. 

We provide in table quantitative classifiers comparison based on Pierce’s Skill 
Score (PSS) defined in ( |I^ . For each training set, PSS is computed for each classifier. 
LMM is compared with each other classifier (OC) using Wilcoxon signed rank test 
(a paired non-parametric statistical test, see |23l |26|) based on the average difference 
PSSlmm - PSSoc- 

For participant A, LMM outperforms all the other considered classifiers very 
significantly except DQDA from the sample size (80 — 800). Indeed the DQDA 
performances are equivalent to or significantly better than the LMM ones for the largest 


Detecting EEC Evoked Potential using a Wavelet Domain Linear Mixed Model 


22 


sample sizes. Quantitative results for all participants can be found in Supplementary 
Data. The conclusions drawn for participant A are confirmed in most configurations but 
should be modulated as follows. In most situations LMM performs consistently better 
than the other classifiers. This in particular true for small training sets namely when the 
number of errors is in the range (20, 30, 40, 50), where the superiority is supported by 
statistical tests (with only two exceptions where results are equivalent). These results 
can be explained by the preprocessing which enforces the data decorrelation and are 
coherent with the results of Box’s M test (see section 4.2.1). 

The proposed method is therefore a valuable alternative to linear discriminant 
techniques which involves quadratic decision rule avoiding the difficulty of usual 
shortcomings of classical QDA. 


(a) (b) (c) 





# training samples (error - correct) 



(e) 


(f) 






LMM - Err 

— — LMM - PC 
RDA sub - Err 

— — RDA sub - PC 

20-200 40-400 60-600 80-800 100-1000120-1200 



Figure 8. Good classification rates for different training sample sizes {error — 
correct). Results correspond to good classification rates averaged over 100 iterations 
for each training sample size. Vertical bars represent standard deviations (std). Results 
are given for error (plain lines) and correct (dotted lines) test trials. LMM classifier is 
compared with three different classifiers on the unbalanced and re-balanced datasets, 
(a), (b), (c) display results in the unbalanced case (10% error trials) for classic LDA 
(LDA), Regularized LDA (RDA) and Diagonal LDA and QDA (DLDA & DQDA) 
respectively, (d), (e), (f) display results of the LDA, RLDA and DLDA classifiers after 
re-balancing datasets by subsampling the correct class. 


Remark 4 As mentioned above, the correct classification rate depends heavily on the 
quality of the estimation of the covariance matrices. In our case, the latter are strongly 
simplified, and the issue is the quality of the estimation of the mean vectors /i'’ and 
the variances and Thanks to pre-processing and dimension reduction, variances 
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Table 3. Participant A: Classifiers comparison based on PSS. LMM performances 
are compared with other classifiers using the Wilcoxon signed rank test. For each pair 
of classifiers LMM-other classiher, the difference between the two mean PSS is given 
and p-value order of magnitude is provided between parentheses. 


N^-N° 



Classifiers 

comparison 


LMM-LDA 

LMM-RDA 

LMM-DLDA 

LMM-DQDA 

20-200 

0.14 

(10-12) 

0.36 

(10-1®) 

0.37 

(10-1®) 

0.41 (10-1®) 

30-300 

0.34 

(10-1®) 

0.34 

(10-1®) 

0.32 

(10-1®) 

0.32 (10-1®) 

40-400 

0.35 

(10-1®) 

0.30 

(10-1®) 

0.25 

(10-1®) 

0.21 (10-1®) 

50-500 

0.28 

(10-1®) 

0.25 

(10-1®) 

0.16 

(10-1®) 

0.09 (10-1®) 

60-600 

0.25 

(10-1®) 

0.23 

(10-1®) 

0.13 

(10-1®) 

0.07 (10-1®) 

70-700 

0.21 

(10-1®) 

0.19 

(10-1®) 

0.10 

(10-1®) 

0.03 (10-1) 

80-800 

0.18 

(10-1®) 

0.17 

(10-1®) 

0.08 

(10-17) 

0.009 (0.43) 

90-900 

0.14 

(10-1®) 

0.13 

(10-1®) 

0.06 

(10-1®) 

-0.01 (10-®) 

100-1000 

0.12 

(10-1^) 

0.11 

(10-17) 

0.05 

(10-1°) 

-0.02 (10-1) 

110-1100 

0.09 

(10-12) 

0.09 

(10-1®) 

0.04 

(10-7) 

-0.02 (10-1) 

120-1200 

0.07 

(10-1) 

0.08 

(10-®) 

0.04 

(10-2) 

0.005 (0.23) 


are estimated from KJ{N^ + samples, while mean vectors are estimated from 
N'^ samples. Hence the qnality of the classihcation is mainly driven by the quality 
of the estimation of the mean vectors. Since we consider nnbalanced sitnations, the 
difhcnlt class is the minority class (in onr case the error class) for which approximately 
30 samples or more are needed to get more than 60% correct classihcations. The other 
class being ten times bigger, the corresponding classihcation rate is close to 100%, and 
the global classihcation rate is satisfactory. 

Similar resnlts nsing diherent Danbechies hlters are presented in Snpplementary 
Data. Whatever the hlter, the proposed method consistently ontperforms concurrent 
approaches in terms of classihcation, best results being obtained when the Haar hlter is 
used. 

4.4- Back to time courses: adjusted single-trials 

As described in subsection |3.5[ for each single-trial an estimate of the signal of interest 
can be obtained according to After back projection onto sensor space (using a 

Moore-Penrose pseudo-inverse) and inverse DWT we obtain time courses called adjusted 
single-trials. Examples of adjusted single-trials (electrode FCz) are displayed in hgure[^ 
These plots are obtained using the D6 Danbechies hlter. Adjusted single-trials obtained 
using other Danbechies hlters are displayed in Supplementary Data. 

The left hand plot (a) represents the raw signals averaged over trials (within each 
class) together with the averages of adjusted single-trials. As can be seen for each class 
the two averages are very close to each other for participant A. For other participants, 
this similarity is less striking but still present. This shows that neither dimension 
reductions nor modelling have strongly ahected the average information. Although 
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not surprising and even expected, the recovery of the global shape of the average is 
a first necessary step before looking at single-trial reconstruction. It appears that the 
modelling indeed captures the signal of interest and filters out the background activity. 

As largely reported in the literature (see HU for an overview), in the error class, 
the estimated averages as well as the adjusted single-trials show a clear negative wave 
around 150 ms after EMG onset followed by a positive one. Such a negative wave is not 
observable in the correct class, as commonly reported with monopolar recordings. 

The added value of the model is the explicit description and the numerical 
quantification of the inter-trial variability. The latter appears clearly in the right hand 
plot of figure |^b) where adjusted single-trials are displayed. The proposed method 
therefore allows one to reveal the amplitude variation of the ErrP: while some errors 
induce a very large activity, some others present virtually no response. The same figure 
is displayed for each participant in Supplementary Data. 

Although speculative at this point, this could reveal differences in sensitivity to the 
errors across trials, an hypothesis that now becomes testable thanks to the proposed 
method. 



Figure 9. Illustration of single-trial reconstruction, (a) Error and correct trial 
averages on FCz for participant A. Dashed curves: averages across raw trials in both 
classes; full curves: averages across adjusted trials, (b) Adjusted single-trials for Err 
(red) and PClblue) classes on FCz. 


5. Discussion 

We first highlight the key points and main contributions of our method and then discuss 
other possible applications of this work. 

5.1. The impact of time-domain decorrelation 

Within the proposed procedure, we would like to emphasize that the introduction 
of DWT constitutes a key step for the analysis and the classification of single-trials. 
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Considering mnltiscale coefficients instead of time-point valnes first gives a simplified 
representation of data, bnt it also has important conseqnences for both modelling and 
classification. Indeed, decorrelation and dimension rednction lead to small size, diagonal 
dominant covariance matrices, easier to estimate. Conseqnently, in the wavelet domain 
single-trials can be modelled by simple linear mixed model with a small nnmber of 
nnknown parameters, allowing robnst estimates from small nnbalanced training set and 
good classification results. 

5.2. Eine structure of the model 

Mixed model provides a general and flexible approach to analyze single-trial EEC 
signals. It allows one to include both fixed and random effects within the same analysis. 
In the case of EEC datasets it appears natural to consider trial effects as random 
effects since they can be seen as originating from a random selection from a much larger 
set. Furthermore, it is of direct relevance to separate inter and intra-trial variability. 
However for each problem and dataset of interest, the main difficulty lies in the choice 
of the dimension of the random effect vector, its class-dependence and the covariance 
structure. 

For ErrP detection problem, the training sample size being small, we choose 
the simplest possible linear mixed model: class as fixed effect (class-average), one 
dimensional random effect vector (class-independent) modulated by a class-dependent 
coefficient vector E'^ plus a white noise. The choice of T^ depends on the dataset and 
can be done using prior information or exploratory data analysis. In our application the 
observed relationship between amplitude and variability led us to set T'^ equal to the 
average over trials in each class. 


5.3. Relevant classification results 


In section |4.3[ we provided a systematic comparison of related classification techniques, 
all based upon linear models, involving the same hxed part (class-average) and various 
approaches to model variability in the unbalanced situation. All methods were applied 
after the same preprocessing step, that enforces diagonal dominance for covariance 
matrices. Consequently, methods that can exploit such diagonal dominance properties 
are particularly relevant. As expected, results show that they signihcantly outperform 
classic LDA or QDA in terms of classihcation. Moreover our classihcation results 
highlight two important points. The first one is that our procedure is particularly 
efficient when the size of the training sample is small, emphazising the relevance of 
the proposed variability modelling through the linear mixed model. Secondly, beyond 
a certain sample size, namely when the number of observations permits to estimate 
precisely the diagonal of the covariance matrix in the minority class, diagonal QDA 
outperforms all considered methods. That point is coherent with the results of the 
Box’s M test of equality of class-covariance matrices. For all participants, when the 
number of trials is sufficient to calculate the test statistic. Box’s M test rejects the 
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hypothesis of equality (p-value < 10“^). Unlike the balanced case where LDA generally 
outperforms QDA even when class-covariance matrices are different, diagonal QDA gives 
better results than diagonal LDA. In the unbalanced case, we stress that it is preferable 
to take into account the difference in variability of both classes. However it is well known 
that the classiher performance depends on the accuracy of the estimate of parameters 
involved in the procedure, and to obtain accurate parameter estimates the number of 
samples must be much larger than the number of parameters. Therefore when the 
sample size is small, modelling the intra-class variability is well suitable. In our model, 
in addition to the class average estimates, only two parameters are needed to estimate 
the two class-covariance matrices. 


5.4- Possible applications 

Being able to exploit the single-trial M/EEG component would open a new window on 
the dynamics of brain processes, and the range of applications is wide. Although some 
might be obvious, let us focus on some areas where the proposed approach might be 
especially useful. 


5.4-1. BCI. In the present application, we considered a discriminant framework. In 
this context, BCI naturally becomes a straightforward extension for the proposed 
LMMC procedure. Indeed, classihcation results, presented in section 4.3.2, illustrate 
the performance of single-trial variability modelling on classifying error vs correct 
trials. Importantly, the total variance explicit modelling, combined with spatial and 
temporal dimension reductions, allows robust parameters estimates, even when based 
on a limited training dataset. Sticking to a small training dataset is essential to 
keep the BCI calibration session as short as possible, a condition that can be reached 
thanks to the simplicity of the proposed model. The procedure, as described above, 
is not restricted to error potential estimate, however, and can easily be extended 
to other BCI protocols. Indeed, a strength of LMMC method is the modularity 
and interchangeability of the preprocessing steps. Concerning the spatial dimension 
reduction, well established spatial hltering methods have been developed for specihc 
BCI tasks: for example Common Spatial Pattern (CSP) are especially suited spatial 
Liters for motor imagery [SU] whereas xDAWN has been developed for P300 data |45] . 
Those methods can easily be implemented in the LMMC procedure while leaving the 
modelling step almost unchanged. The same holds for temporal dimension reduction. 
In the current application, to get rid of temporal correlation and for temporal dimension 
reduction, we used wavelet transform which is well suited to extract transient phenomena 
such as ERP’s (ErrP, P300, etc.). However, the choice of the basis can depend on the 
data to analyze, and for more oscillatory signals such as the ones recorded in motor 
imagery (like /i and 13 rhythms, see [35]), other transforms such as time-frequency 
transforms (MDCT, STFT, see [35|) or adaptive transforms [531 I3| might be more 
suitable. Again, changing the transform leaves the variance modelling steps almost 
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unchanged, as long as the transform is invertible. 

5.4-2. Single-trial extraction in cognitive neurosciences. In many experimental 
situations, being able to extract the single-trial amplitude of brain activity would open a 
new window on the dynamics of brain processes. One example is the temporal evolution 
of brain activity during learning: across repetitions of the learning situation, some 
brain regions might become more active, while some other might disengage from the 
task. Relying on averaging, at best, dramatically reduces the analysable temporal 
dynamic of the learning process. For example, while no clear learning effect could 
be evidenced on the raw Auditory Evoked Potentials (AEP) in rats, extracting the 
single-trial component (through denoisig in this case) allowed to regress learning curves 
on the trial-by-trial responses, and show that the amplitude of some components of 
the AEP reduces with learning while other remain unchanged |16]. Averaging also 
prevents analyzing correlations between brain activities and behaviour across trials. 
For example, RT or psychophysical judgements are known to be highly variable even 
for the very same stimulation. The variability in RT have long be known to provide 
essential information to put into test different model of information processing [M] . 
Unfortunately, the equivalent variability is normally not accessible for EEG/MEG brain 
responses, preventing the use of similar constraints on models based on brain activity 
(see e.g. [22], for trial-to-trial variability to constraint RT models). The LMM approach 
furnishes a way to recover this variability (at least in amplitude) and may help to 
correlate brain activities to behaviour. Further applications will probe what type of 
constraint such variability can bring for model testing. 

Another key interest of the proposed approach is the ability to better analyze 
unbalanced data, which is a quite common situation. Indeed, from a statistical point of 
view, LMM allows increasing the robustness of the estimates from small datasets and 
hence reducing the risk of type II error (false negative), which is a common problem in 
small and noisy dataset. It may also help to decrease the number of necessary trials to 
reach a given statistical power, which can be of tremendous interest on some populations 
(children, patients, etc.). 

One may wonder, however, how general this methodology can be and whether it 
would apply with similar success on other brain activities which might not be as stable 
as the ErrP. Although we do not have any empirical response, formal analysis of the 
model allows speculating a bit. In the current modelling, the class average plays a 
central role. One can therefore anticipate that the quality of the modelling will largely 
be driven by the quality of the average. As a consequence, as long as the average emerges 
from the background noise, that is, as long as there is an ERP, it should be possible 
to apply the model. For example, it should be possible to apply the method on robust 
components, such as the standard visual evoked potentials (Nl, P2, N2 etc.). This will 
be investigated on future research. 
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6. Conclusions and future works 

In the present work we propose a single-trial classification procedure (LMMC) based 
on an explicit model of the signal variability in the wavelet domain. Each single-trial 
signal is assumed to be the sum of two components: a background activity and a signal 
of interest. The latter can again be decomposed into the sum of a fixed part (the 
class-mean) and a random part that models the deviation of each single-trial from the 
class-mean. This hypothesis is formalized mathematically as a Gaussian linear mixed 
model (LMM). This model generalizes the Gaussian linear model given in which 
was adapted to situations where between-trial variability is negligible. An important 
step is the discrete wavelet transform in the preprocessing. On the one hand, this 
transformation allows one to reduce the time dimension and on the other hand, to 
consider a very simple model where class-covariance matrices are characterized by only 
few parameters (to be estimated). The advantage of such a model is to yield robust 
estimates of parameters from a small dataset. 

This model is applied to a classification problem, namely separating correct from 
erroneous responses in a RT task. As expected, the proposed classification procedure is 
particularly efficient to detect EEG Error Potentials in the unbalanced case when the 
sample size of the minority class is small. Moreover these results validate the chosen 
modelling. 

In addition, the proposed procedure allows the extraction of relevant and 
discriminant information from EEG signals. As a matter of fact model parameters 
estimates provide quantitative measures of the between-trial variability in each class in 
addition to the average behaviour. In the specific context of error potential detection, the 
proposed model allows one to recover single-trial signal of interest and more interestingly 
reveals the trial-by-trial amplitude variation of ErrP’s. 

Several directions might be interesting to investigate further and we list a few of 
these below. 

First, the choice that was made for random part coefficient vector P'^ explicitly assumes 
a linear relationship between the signal of interest’s amplitude and variability. However, 
although the existence of a monotonic dependence between mean and variance seems 
a reasonable assumption, the relationship needs not be linear and is likely to be more 
complex (for example, class dependent). Further investigating and characterizing the 
link between mean and variance should improve the choice of the random component 
and hence the modelling performance. 

Another improvement would be to better take into account the cubic structure of 
M/EEG data (he space x time x trials). Decoupling spatial and temporal features 
extraction [53] could lead to a more complex form for the random effects design matrix 
and hence to a spatio-temporal modelling of between-trial variability. 

Gomparing the EEG error potentials across participants, not only on this dataset 
but in the literature, see e.g. jlHl figure 3], suggests large similarities between 
participants, but also some differences, both in terms of topography and time courses. 
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One might therefore consider incorporating also the between participant variability 
within a more sophisticated mixed model, which would then model the deviation of 
the participant (in time and in topography) to the mean behaviour. The model would 
include common settings and parameters specihc to each subject. From a statistical 
point of view, the introduction of such common settings would presumably increase the 
robustness of the estimates from small datasets. This would in particular allow a more 
robust estimate of the single participant topography and time course. The question as 
to whether such differences in topography is related to hne anatomical differences HU 
would become easier to test, especially when few trials are available. 
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